-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new script hocr-cut for cutting a page #108
Conversation
This cuts a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
It was fixed using `yapf -i --style pep8 hocr-cut`. Signed-off-by: Stefan Weil <[email protected]>
Tesseract uses image names enclosed in "" which must be stripped because otherwise opening the image will fail. Signed-off-by: Stefan Weil <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO useful to include in master.
Done. Thank you, Philipp and Konstantin, for the contribution and the review. |
Should we tag a new release based on master? |
The script could be extended to create two new hOCR files for left and right page, too. |
New release sounds good, but there is already one drafted. Sorry forgot about this. Maybe we can do two new releases 1.2.1 and 1.3.0? Improving the script sounds fine, also I expect that after cutting a double page into two single pages, it might be better to run OCR on each of those again. |
Let's start with 1.2.1, then create 1.3.0. Running OCR again on the single pages is reasonable, but can cost a lot of resources if many pages have to be processed, so separated hOCR from the initial double pages can be desired in certain situations. |
This cuts a page (horizontally) into two pages in the middle
such that the most of the bounding boxes are separated nicely,
e.g. cutting double pages or double columns.
For example this double pages
is cut in the middle and outputs a left and right page
The whole computation is based on the bounding boxes, and therefore needs the input of some OCR or layout segmentation process. But it might be possible to OCR the individual pages afterwards again to receive better results then (e.g. skewing might be more consistent along one page compared to a double page).